Improving Text Normalization via Unsupervised Model and Discriminative Reranking

نویسندگان

  • Chen Li
  • Yang Liu
چکیده

Various models have been developed for normalizing informal text. In this paper, we propose two methods to improve normalization performance. First is an unsupervised approach that automatically identifies pairs of a non-standard token and proper word from a large unlabeled corpus. We use semantic similarity based on continuous word vector representation, together with other surface similarity measurement. Second we propose a reranking strategy to combine the results from different systems. This allows us to incorporate information that is hard to model in individual systems as well as consider multiple systems to generate a final rank for a test case. Both wordand sentence-level optimization schemes are explored in this study. We evaluate our approach on data sets used in prior studies, and demonstrate that our proposed methods perform better than the state-of-the-art systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised training methods for discriminative language modeling

Discriminative language modeling (DLM) aims to choose the most accurate word sequence by reranking the alternatives output by the automatic speech recognizer (ASR). The conventional (supervised) way of training a DLM requires a large amount of acoustic recordings together with their manual reference transcriptions. These transcriptions are used to determine the target ranks of the ASR outputs, ...

متن کامل

Improving broadcast news transcription with a precision grammar and discriminative reranking

We propose a new approach of integrating a precision grammar into speech recognition. The approach is based on a novel robust parsing technique and discriminative reranking. By reranking 100-best output of the LIMSI German broadcast news transcription system we achieved a significant reduction of the word error rate by 9.6% relative. To our knowledge, this is the first significant improvement f...

متن کامل

Large-scale discriminative language model reranking for voice-search

We present a distributed framework for largescale discriminative language models that can be integrated within a large vocabulary continuous speech recognition (LVCSR) system using lattice rescoring. We intentionally use a weakened acoustic model in a baseline LVCSR system to generate candidate hypotheses for voice-search data; this allows us to utilize large amounts of unsupervised data to tra...

متن کامل

Survey on Three Reranking Models for Discriminative Parsing

This survey is inspired by the so-called reranking techniques in natural language processing (NLP). The aim of this survey is to provide an overview of three main reranking tasks particularly for discriminative parsing. We will focus on the motivation for discriminative reranking, on the three models, boosting model, support vector machine (SVM) model and voted perceptron model, on the procedur...

متن کامل

Discriminative codebook learning for Web image search

Given the explosive growth of the Web images, image search plays an increasingly important role in our daily lives. The visual representation of image is the fundamental factor to the quality of content-based image search. Recently, bag-of-visual word model has been widely used for image representation and has demonstrated promising performance in many applications. In the bag-of-visual-word mo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014